docs(operations): add containerized GPU workloads guide#555
docs(operations): add containerized GPU workloads guide#555Aleksei Sviridkin (lexfrei) wants to merge 1 commit into
Conversation
✅ Deploy Preview for cozystack ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
📝 WalkthroughWalkthroughAdds a new operations guide documenting how to run containerized GPU workloads on Cozystack management nodes using the ChangesGPU Container Workloads Documentation
Possibly related issues
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request adds a new documentation page detailing how to run containerized GPU workloads using the container variant of the cozystack.gpu-operator package. The review feedback suggests specifying the cozy-system namespace in both the kubectl patch command and the Package resource manifest to ensure they are applied to the correct namespace.
| kubectl patch packages.cozystack.io cozystack.cozystack-platform --type=json \ | ||
| -p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]' |
There was a problem hiding this comment.
In Cozystack, the Package resources (including cozystack.cozystack-platform) are typically located in the cozy-system namespace. Running kubectl patch without specifying the namespace will fail if the user's current context is set to another namespace (like default). Adding -n cozy-system ensures the command runs successfully.
| kubectl patch packages.cozystack.io cozystack.cozystack-platform --type=json \ | |
| -p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]' | |
| kubectl patch packages.cozystack.io cozystack.cozystack-platform -n cozy-system --type=json \\ | |
| -p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]' |
| apiVersion: cozystack.io/v1alpha1 | ||
| kind: Package | ||
| metadata: | ||
| name: cozystack.gpu-operator | ||
| spec: | ||
| variant: container |
There was a problem hiding this comment.
The Package resource needs to be created in the cozy-system namespace for the Cozystack operator to detect and reconcile it. Adding namespace: cozy-system to the metadata ensures it is applied to the correct namespace.
| apiVersion: cozystack.io/v1alpha1 | |
| kind: Package | |
| metadata: | |
| name: cozystack.gpu-operator | |
| spec: | |
| variant: container | |
| apiVersion: cozystack.io/v1alpha1 | |
| kind: Package | |
| metadata: | |
| name: cozystack.gpu-operator | |
| namespace: cozy-system | |
| spec: | |
| variant: container |
3170d45 to
8b83e54
Compare
|
Actionable comments posted: 0 |
8b83e54 to
b9cae43
Compare
myasnikovdaniil
left a comment
There was a problem hiding this comment.
Thanks — this is a well-researched page and most of it checks out against the companion PR cozystack/cozystack#2766 and the platform chart. A few substantive items before merge.
Main blocker: the Fractional GPU sharing section directs users into a device-plugin registration conflict (see inline comment). HAMi does not reuse the operator's device plugin — it ships its own, and the auto-disable that prevents the clash only exists in the tenant kubernetes app chart, not on the management cluster. The container variant pins devicePlugin.enabled: true, so stacking cozystack.hami on top as written runs two plugins both registering nvidia.com/gpu.
Sequencing: cozystack/cozystack#2766 (which adds the container variant) is still open. This page documents a variant that doesn't exist yet — please hold merge until #2766 lands, or confirm both ship in the same release train.
Smaller accuracy/UX fixes inline. Recommendation: request changes.
|
|
||
| ## Fractional GPU sharing | ||
|
|
||
| The `container` variant exposes whole GPUs through the upstream NVIDIA device plugin. To slice one GPU across multiple pods (memory and compute quotas per pod), enable HAMi on top — HAMi reuses the same device plugin layer and is wired in via the `cozystack.hami` package, which already depends on `cozystack.gpu-operator`. See [GPU Sharing with HAMi](/docs/next/kubernetes/gpu-sharing/) for the tenant Kubernetes flow; for management-cluster workloads the wiring is the same package set with HAMi enabled. |
There was a problem hiding this comment.
- "HAMi reuses the same device plugin layer" is wrong. HAMi ships its own device plugin + scheduler extender. The page you link to states the opposite: "When HAMi is enabled, GPU Operator's built-in device plugin is automatically disabled to avoid resource registration conflicts."
- That auto-disable only lives in the tenant
kubernetesapp chart (packages/apps/kubernetes/tests/gpu_operator_hami_test.yaml— "should disable devicePlugin when hami is enabled"). The management-clustercozystack.hamiPackageSource only declaresdependsOn: cozystack.gpu-operator(install ordering);packages/system/hami/values.yamldoes not touch the operator's device plugin. - The
containervariant pinsdevicePlugin.enabled: true(values-container.yamlin #2766). Stackingcozystack.hamion top, as written, runs two device plugins both registeringnvidia.com/gpu— exactly the conflict the HAMi doc warns about.
Suggested rewrite:
The `container` variant exposes whole GPUs through the upstream NVIDIA device plugin.
For fractional sharing (per-pod memory and compute quotas), see
[GPU Sharing with HAMi](/docs/next/kubernetes/gpu-sharing/) — currently documented for
tenant Kubernetes clusters, where enabling HAMi automatically disables the GPU Operator's
built-in device plugin to avoid resource-registration conflicts. Stacking the
`cozystack.hami` package directly on top of the `container` variant on the management
cluster is not a supported combination yet: the variant pins the NVIDIA device plugin on,
and running it alongside HAMi's device plugin causes both to register `nvidia.com/gpu`.The intro at line 10 ("you can stack HAMi on top once the container variant is up") echoes the same claim and should be softened to match.
| ## Prerequisites | ||
|
|
||
| - A Cozystack management cluster with at least one GPU-enabled node. | ||
| - The GPU node runs a supported Linux distribution (Ubuntu, Debian, RHEL, Fedora, openSUSE) with the NVIDIA driver installed via the distro package manager. Verify with `nvidia-smi` over SSH or `kubectl debug node/<node-name>` — it must enumerate the physical GPUs and report a working driver version. |
There was a problem hiding this comment.
The companion PR's own OS-support table (docs/gpu-vgpu.md in #2766) only covers Ubuntu 20.04–26.04 and Talos. Cozystack's documented node-OS surface is Talos + Ubuntu/Debian (ansible path). Listing RHEL/Fedora/openSUSE as "supported" presents untested territory as fact.
- The GPU node runs Ubuntu or Debian with the NVIDIA driver installed via the distro
package manager (other distros with an equivalent driver + toolkit package layout
should work the same way but are not regularly tested). Verify with `nvidia-smi` …|
|
||
| - A Cozystack management cluster with at least one GPU-enabled node. | ||
| - The GPU node runs a supported Linux distribution (Ubuntu, Debian, RHEL, Fedora, openSUSE) with the NVIDIA driver installed via the distro package manager. Verify with `nvidia-smi` over SSH or `kubectl debug node/<node-name>` — it must enumerate the physical GPUs and report a working driver version. | ||
| - `nvidia-container-toolkit` installed on the same node and registered with containerd (`grep nvidia /etc/containerd/config.toml` shows the runtime entry). |
There was a problem hiding this comment.
apt install nvidia-container-toolkit alone does not modify containerd config — registration is a separate manual step. A reader on a fresh node will fail this grep with no pointer to the fix. Suggest spelling out the registration:
- `nvidia-container-toolkit` installed on the same node and registered with containerd:
```bash
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd
grep nvidia /etc/containerd/config.toml # must show the runtime entry|
|
||
| ```bash | ||
| kubectl apply -f cuda-smoke.yaml | ||
| kubectl logs cuda-smoke |
There was a problem hiding this comment.
Run back-to-back, kubectl logs errors while the (large) CUDA base image is still pulling. Add a wait:
kubectl apply -f cuda-smoke.yaml
kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/cuda-smoke --timeout=5m
kubectl logs cuda-smoke| - A Cozystack management cluster with at least one GPU-enabled node. | ||
| - The GPU node runs a supported Linux distribution (Ubuntu, Debian, RHEL, Fedora, openSUSE) with the NVIDIA driver installed via the distro package manager. Verify with `nvidia-smi` over SSH or `kubectl debug node/<node-name>` — it must enumerate the physical GPUs and report a working driver version. | ||
| - `nvidia-container-toolkit` installed on the same node and registered with containerd (`grep nvidia /etc/containerd/config.toml` shows the runtime entry). | ||
| - `kubectl` configured against the management cluster. |
There was a problem hiding this comment.
Minor gotcha worth one prerequisite line: the container variant relies on the upstream default workload container for unlabeled nodes. A node still carrying nvidia.com/gpu.workload.config=vm-passthrough from the GPU Passthrough guide overrides that per-node and the device plugin won't serve it — a likely trip-up when migrating a node off the passthrough setup.
- The GPU node must not carry a `nvidia.com/gpu.workload.config` label left over from the
passthrough setup (`kubectl label node <node-name> nvidia.com/gpu.workload.config-` to remove).Document the new container variant of cozystack.gpu-operator, paired with cozystack/cozystack#2766. Covers the apt-installed-driver-and-toolkit Linux shape that the variant targets: when to pick it over the passthrough and vGPU variants, prerequisites (host driver + host nvidia-container-toolkit registered with containerd via nvidia-ctk runtime configure, validated with nvidia-smi over kubectl debug), the host-driver reuse path (driver.enabled=false, so the operator uses the pre-installed driver at its standard location with no driverInstallDir override needed on a stock apt install), the Talos caveat with a pointer to the values-native-talos.yaml reference, install steps, a sample CUDA pod for verification, the variant comparison matrix, and a note on why stacking HAMi directly on the container variant on the management cluster is not a supported combination yet (both register nvidia.com/gpu). Lands under operations/ — symmetric with virtualization/gpu.md (VM passthrough on management cluster) and kubernetes/gpu-sharing.md (HAMi in tenant Kubernetes addons). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
b9cae43 to
f2ae9b7
Compare
|
Thanks — addressed in the latest push. HAMi (the blocker) — rewritten. You're right: HAMi ships its own device plugin, the operator-device-plugin auto-disable lives only in the tenant OS support — narrowed to Ubuntu/Debian as tested; RHEL/Fedora/openSUSE are no longer presented as supported, just "should work but not regularly tested." containerd registration — spelled out with the explicit Leftover CUDA smoke pod — added Validator path — same reframe as the code PR: dropped On the bot's namespace suggestions ( Sequencing: agreed — this should land with / after cozystack/cozystack#2766. The page is in the |
myasnikovdaniil
left a comment
There was a problem hiding this comment.
NOT LGTM — the practical advice in the bundles.enabledPackages warning is right, but its stated failure mechanism is factually wrong and will mislead operators.
Business context: documents the container variant of cozystack.gpu-operator for running CUDA pods on management-cluster nodes that already ship the NVIDIA driver + container toolkit from the distro package manager.
Status of the requested changes (2026-06-08 review)
- ✅ HAMi device-plugin conflict — the Fractional GPU sharing section now explains
cozystack.hamiand thecontainervariant both registernvidia.com/gpuand aren't a supported combination. - ✅ OS support scope — Ubuntu/Debian primary, other distros "not regularly tested."
- ✅ containerd
nvidiaruntime registration —nvidia-ctk runtime configure+ restart + verify present. - ✅ leftover
nvidia.com/gpu.workload.configlabel — prerequisite bullet with removal command added. - ✅ CUDA smoke-pod —
kubectl wait …Succeededadded beforekubectl logs. - ✅ host-driver /
driver.enabled=falsepath — reframed clearly; Talos caveat points at the reference values file.
Outstanding
B1 (blocker) — bundles.enabledPackages warning states the wrong failure mechanism — inline at line 41. The text says the bundle "hardcodes spec.variant: default" and "any user Package CR with variant: container is overwritten on the next reconcile." Neither is what happens: iaas.yaml renders the GPU operator via cozystack.platform.package with $gpuVariant = bundles.iaas.gpuOperatorVariant | default "default", and fails the Helm render if that value isn't default/vgpu. So container via the bundle path is a hard render error, not a silent overwrite. Keep the conclusion (use a standalone Package CR); fix the reason. Suggested wording inline.
Non-blocking:
- #2766 passed
helm template+ unit tests but no hardware CUDA run — a "provisional pending hardware validation" note would help calibrate trust. - Prerequisite ordering: the
nvidia.com/gpu.workload.configremoval bullet sits after the containerd-registration block; a node migrating from the passthrough guide would remove the label before/with toolkit registration.
Analysis — where the issues come from
- Original code: B1 (wrong bundle-mechanism text) and the ordering nit are both in the initial commit
f2ae9b7. - Introduced by post-review fixes: none — the branch is a single commit; no regressions added.
- Unresolved from the previous review: none — all six asks addressed.
|
|
||
| ## 1. Install the GPU Operator (container variant) | ||
|
|
||
| **Do not** add `cozystack.gpu-operator` to `bundles.enabledPackages` for this variant. The platform Helm chart's optional-package template hardcodes `spec.variant: default` for every name in `enabledPackages` and reconciles the resulting `Package` CR under Helm ownership — any user `Package` CR with `variant: container` is overwritten on the next reconcile. Apply the `Package` CR directly instead; the cozystack platform controller installs it without the bundle entry. |
There was a problem hiding this comment.
The stated reason here is incorrect, though the practical advice is right. gpu-operator in the iaas bundle does not go through the cozystack.platform.package.optional.default helper and does not hardcode spec.variant: default. iaas.yaml renders it via cozystack.platform.package with $gpuVariant = bundles.iaas.gpuOperatorVariant | default "default", and immediately fails the Helm render if that value is anything other than "default" or "vgpu":
{{- if not (or (eq $gpuVariant "default") (eq $gpuVariant "vgpu")) -}}
{{- fail (printf "bundles.iaas.gpuOperatorVariant must be \"default\" or \"vgpu\", got %q" $gpuVariant) -}}
{{- end -}}
So "container" via the bundle path causes a hard Helm render failure, not a silent overwrite — the user Package CR is never touched because the chart never renders. Suggested replacement:
Do not add
cozystack.gpu-operatortobundles.enabledPackagesfor this variant. Theiaasbundle template only acceptsbundles.iaas.gpuOperatorVariant: defaultorvgpu; any other value — includingcontainer— causes a hard Helm render failure (packages/core/platform/templates/bundles/iaas.yaml). Apply thePackageCR directly instead; the platform controller installs it without a bundle entry and without the variant restriction.
What this PR does
Add a new operations guide describing the
containervariant ofcozystack.gpu-operator— the architectural mode for containerized GPU workloads (CUDA pods, ML training, inference) on Linux GPU nodes that already ship the NVIDIA driver andnvidia-container-toolkitvia the distro package manager.The new page lands at
content/en/docs/next/operations/gpu-container-workloads.mdand rounds out the GPU documentation surface:defaultvariant).containervariant).Content covers when to pick the variant (host driver + host toolkit + a containerd-registered
nvidiaruntime prerequisite), the host-driver reuse path (driver.enabled=false, so the operator uses the pre-installed driver at its standard location with nodriverInstallDiroverride on a stock apt install), the Talos caveat with a pointer to theexamples/values-native-talos.yamlreference, install steps withPackageCRvariant: container, a sample CUDA pod for verification, why stacking HAMi directly on this variant is not supported yet, and a three-row variant comparison matrix.Companion to cozystack/cozystack#2766, which adds the
containervariant itself.Release note
Summary by CodeRabbit